Introduction to Open Data Science - Course Project

Chapter 1: About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

today <- date()
paste("Today is", today, ". Hello R! Hello world!")
## [1] "Today is Thu Dec  3 19:02:54 2020 . Hello R! Hello world!"

My background and expectations

How I found out about this course: I already came across this course last year (unfortunately didn’t have the time to participate) and was already intrigued about the contents and impressed by the fabulous feedback.

My background and expectations: As a graduate student in physics I got quite fond of using python for my data analysis work. Now, I am quite excited to see and compare what is possible with R - I am looking forward to participate in this interesting course! So far, the “warming up” week went fine and the introductions were easy to follow. The syntax shown so far didn’t look too complicated and partially familiar to python or python libraries such as pandas and matplotlib. As I am quite used to python (and it is more and more used in natural sciences), I don’t expect to switch to R after the course, but would love to learn more about it and achieve an understanding of what the pros and cons are. So if something comes up in the future for what R is actually more suited, I will be able to use it or – if collaborators use it – able to understand it. And furthermore, it’s always fun to learn new skills.:)

Practicalities

Link to my GitHub repository : https://github.com/kirstef/IODS-project

Looking forward to next week!


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

In this chapter we will explore the dataset from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt, looking first closer at the dataframe and afterwards visualize some interesting features.

Data wrangling

REMARK: The data wrangling part is done in the .R file in the data-folder. Still, I will show a preview of the original dataset in the diary as it shows the development from the original data to the dataset we finally evaluate.

Looking at the original dataset

First let’s read in the full learning2014 data as a dataframe from the given website.

lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
head(lrn14)
##   Aa Ab Ac Ad Ae Af ST01 SU02 D03 ST04 SU05 D06 D07 SU08 ST09 SU10 D11 ST12
## 1  3  1  2  1  1  1    4    2   4    4    2   4   4    3    3    2   3    3
## 2  2  2  2  2  1  1    4    2   4    4    4   2   3    4    4    1   4    1
## 3  4  1  1  1  1  1    3    1   4    4    2   3   4    1    3    1   4    4
## 4  4  2  3  2  1  1    3    3   4    4    3   4   4    2    3    1   3    3
## 5  3  2  2  1  2  1    4    2   5    3    4   4   4    3    4    2   4    2
## 6  4  2  1  1  1  1    4    3   5    4    3   5   5    4    4    1   5    3
##   SU13 D14 D15 SU16 ST17 SU18 D19 ST20 SU21 D22 D23 SU24 ST25 SU26 D27 ST28
## 1    3   4   3    2    3    2   4    2    3   3   2    2    4    4   4    4
## 2    3   2   3    4    4    2   3    1    2   2   3    4    2    4   2    2
## 3    2   4   2    3    3    1   4    3    2   4   3    3    4    4   3    5
## 4    2   4   3    2    3    1   3    3    3   3   3    2    3    2   3    3
## 5    3   4   3    3    4    1   4    3    2   3   3    4    4    3   3    5
## 6    1   5   4    2    3    2   4    3    4   5   4    2    4    2   5    4
##   SU29 D30 D31 SU32 Ca Cb Cc Cd Ce Cf Cg Ch Da Db Dc Dd De Df Dg Dh Di Dj Age
## 1    3   4   4    3  2  4  3  4  3  2  3  4  3  4  4  5  4  2  4  3  4  4  53
## 2    3   3   4    5  4  4  4  5  5  3  2  4  4  3  3  4  3  2  3  3  2  4  55
## 3    2   4   3    5  3  5  4  4  3  4  4  2  1  4  4  1  4  1  3  1  1  5  49
## 4    3   4   4    3  3  4  4  4  3  4  4  3  2  4  5  2  5  1  5  4  2  5  53
## 5    3   3   4    4  2  4  4  3  3  3  4  4  3  4  4  4  4  2  5  5  3  3  49
## 6    2   5   5    3  3  5  4  4  3  4  5  4  3  5  4  4  4  3  4  3  3  5  38
##   Attitude Points gender
## 1       37     25      F
## 2       31     12      M
## 3       25     24      F
## 4       35     10      M
## 5       37     22      M
## 6       38     21      F

Only looking at the head of the dataframe makes it quite visible that some data wrangling will help in understanding the data better. It also shows that meta data is quite valuable as we don’t know by just looking at the variable what their meaning is.

Data analysis

Preparing an analysis sub dataset for further data exploration

For this we have to include the library dplyr (library(dplyr)) which is a common R library used for data wrangling

library(dplyr)

1. Looking at the new data subset: Reading in the students2014 data

First let’s use read.csv() or read.table() to read in the data subset created before and saved as a .csv in the project data-subfolder.

students2014 <- read.csv("./data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

With head(students2014) we can display the first six rows of the subset. It is now much better readable and interpretable than before.

##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

Structure and dimension of the subset

With dim() and str() we can further explore the dimension and structure of the dataset.

dim(students2014)
## [1] 183   7
str(students2014)
## 'data.frame':    183 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

dim() already gives us that the subset consists of 7 variables and 183 observations. str() gives further information: The variables are those that we defined before (cf. R-script) and consist of 4 combined variables of type numerical which give us the grade that the students obtained in the respective categories in the scale (0-5), 2 integer type variables giving the age and the points, and one character type variable stating the gender.

Getting rid of 0 values We can filter our subset further to e.g. exclude the students who didn’t participate in the final exam and look afterwards again at the dimension.

# select rows where points are greater than zero
students2014 <- filter(students2014, points > 0) 
dim(students2014)
## [1] 166   7

As one can see we have \(183-166=17\) rows with 0 points, so students not taking the final exam. Let’s keep it like this for the further exploration of the data set, as the reasons for not taking the exam are unknown to us and thus don’t necessary have a high relation to the attitude our other variables.

2. Visualizations: Graphical overview of the data

Using ggpairs gives us a good graphical overview of the data and the correlation of the different variables to each other. Using col=gender in the mapping argument let’s us also see gender-differences (color-coded, female(red), male(blueish)).

library(ggplot2)
library(GGally)
# create a plot matrix with ggpairs()
ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

Studying the plot in more detail, we see the following parts: scatter plots between the different variables, distributions of the different variables, and correlation values between the different variables. The higher the absolute value of the correlation value, the higher the correlation between the respective variables. As we can seen, the highest correlation seems to be between points and attitude. Looking at the scatter plot between these two variable we can also see that the data points are roughly scattered along a line, which hints as well at a relation between them. The point distribution for the attitude shows a visible gender difference. Where the distribution for the female students rises to about 2.5 to almost a “plateau” (until 3.5), the distribution for the male students shows only a small fraction of people having the grade 2 or less - after that there is a steep rise to the peak at a bit over 3. The surface question distribution (“surf”) shows a maximum at about 3 for the female students, which is higher than for the male students. For the age one can see (as expected) that most of the students are between 20 and 30 years old.

summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The summary() command gives us again an overview on the statistics of the different variables. We have about twice as many female students as male students, the mean age is between 25-26 years, the mean grade for the question types (deep, strategic, surface) lies between 2.8 and 3.7, with deep questions scoring highest. The mean attitude lies at 3.2 and the points at about 23. Apart from these mean values, we get further information: the minimum and maximum values, the 1st and 3rd quartiles and the median.

3. Regression model

From the pairs plot above let’s choose the points as our target value and the 3 variables with the highest correlation to points as explanatory variables to be used for the regression model: exam points ~ x1 + x2 + x3. The respective variables would be attitude (corr: 0.437), stra (corr: 0.146) and surf (corr: -0.140).

summary(model) gives us further information on the validity and significance of our model. Further description and interpretation will still be added.

# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

To find more out abouth the significance of the correlations we can look at t-value and Pr(>|t|). The understanding of the t-value is here, that it get’s bigger if the Null-Hypothesis is not true. The Null-Hypothesis in this case would be that there is no relation between the variable and the target value. The t-value on its own is difficult to interpret, but Pr(>|t|) gives us then the probability for getting the same t-value if the Null-Hypothesis would be true. As one can see, this probability is very low for attitude, which is also implied by the *** next to the value, which are described as significance codes in the legend. Thus, a relation between points and attitude is quite obvious. Both stra and surf don’t have any significant relation to the target variable, thus I will remove them from the model.

Just for fun once more the model with attitude and stra:

my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

The t-value is a bit higher and the Pr(>|t|) a bit smaller, but still the relation is quite insignificant.

4. Reduced regression model: Summary of fitted model

After having also remove the stra variable from the model, the model is reduced to the following one:

my_model3 <- lm(points ~ attitude, data = students2014)
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The values show no a clear significance of the relation between points and attitude. So what does that mean? Apparently, the attitude towards the course influence the final result in the exam (points). That relation is not really surprising: If you like a course and are interested in it, you are probably learning more or studying more for it and will probably easier get more points. The model created now could now be used to predict the points of a person with attitude=x.

Prediction of points for students with different attitudes

new_students <- c("Student X" = 2.5, "Student Y" = 4.8)
new_data <- data.frame(attitude = new_students)

# Predict the new students exam points based on attitude
predict(my_model3, newdata = new_data)
## Student X Student Y 
##  20.45080  28.55936

Plotting the two variables against each other we can see that the predicted points are right where they should be (as it makes sense, because it is the same model). However, we can see that there is still a lot of scattering of the points above and below the regression lines. But, the tendency of the relationship is quite visible. The multiple R-squared value is a measure for how good our fitted model is and how strong the relation. The value of about 0.2 explains for about 20% variation from the mean. So the model might not be perfect, but still might be good. To judge this further, one has to take a look at the residuals.

library(ggplot2)
# plotting the two variables against each other with aesthetic mapping
p1 <- ggplot(students2014, aes(x = attitude, y = points, col=gender))

# using points for visualization
p2 <- p1 + geom_point()

# add a regression line and plot
p3 <- p2 + geom_smooth(method = "lm")
p3
## `geom_smooth()` using formula 'y ~ x'

5. Diagnostics plot

The Residuals vs. Fitted Values plot shows that the errors are quite normally distributed both above and under the 0-line, which is a good sine for our model.

The QQ plot explains the behaviour for most of inner quantiles and only diverges from the line in the outer quantiles.

The residuals vs. Leverage plot can show us if some points have too much leverage (outliers). In this case there is no outlier visible that completely changes the trend of the curve.

#par(mfrow = c(2,2)) #To put the images in the same figures. However, I prefer them a bit bigger right now.
plot(my_model3, which=c(1,2,5))


Chapter 3: Logistic Regression

Data wrangling

The data wrangling part is performed in the R-script create_alc.R in the data folder and the created dataset is saved as a .csv file (‘alc.csv’) in the data folder.

Data analysis

2. Reading the data and description of the dataset

The dataset we want to look at is a joined dataset from two datasets obtained from the following website: https://archive.ics.uci.edu/ml/datasets/Student+Performance , where it is also further described. It studies the performance of secondary education students of two Portuguese schools in mathematics and portuguese. The data sets content are answers of the students to different questions, which concern different areas of working and personal life, social background, circumstance etc.

Here, in our analysis, we will in particular look at the alcohol consumption and its relation to other variables.

To start the data analysis part, we include the library dplyr (library(dplyr)) and read in the dataset.

library(dplyr)
alc <- read.csv("./data/alc.csv")

To get an overview on the dataset, let’s print out the names of the variables and get the dimension values. The meaning of the values can be found on the above-mentioned website.

colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"
dim(alc)
## [1] 382  36

In total, we have 382 observations and 36 variables in the joined alc dataset.

3. Relation of different values to high/low alcohol consumption

Choose 4 interesting variables from the data and formulate hypothesis towards their relationship with alcohol consumption

Hypothesis 1: sex: There is a statistical difference concerning sex: Male students are more likely to have a high alcohol consumption than female students.

Hypothesis 2 failure: The number of past class failures has a relation to alcohol consumption (more failures leading to more alcohol consumption).

Hypothesis 3 famrel: Good family relationships make high alcohol consume less likely.

Hypothesis 4 goout: The frequency of going out with friends will affect the alcohol consumption.

4. Numerical and graphical exploration of chosen variables

With gather() and gplot() one can first get an overview bar plot for each variable.

library(tidyr); library(dplyr); library(ggplot2)
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Now we can continue by looking at the values interesting for the hypotheses stated above.

Hypothesis 1: gender and alcoholic consumption

alc %>% group_by(sex, high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      156
## 2 F     TRUE        42
## 3 M     FALSE      112
## 4 M     TRUE        72

The cross-tabulation is supporting Hypothesis 1. Whereas 21.2121212% of the female students mention a high alcohol consumption, 39.1304348% of the male students have mentioned a high use, which is almost the double value. To look at this graphically, let’s look at a bar plot.

library(ggplot2)
g2 <- ggplot(alc, aes(x = high_use, col=sex))
g2 + geom_bar()

The bar plot shows quite clearly, that the difference between high-use and no high-use are quite different for male and female students.

Hypothesis 2 - failures: Let’s now investigate if one can see a clear relation between the failures and the alcohol consumption.

alc %>% group_by(high_use) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   high_use count
##   <lgl>    <int>
## 1 FALSE      268
## 2 TRUE       114
alc %>% group_by(failures) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
##   failures count
##      <int> <int>
## 1        0   334
## 2        1    24
## 3        2    19
## 4        3     5
alc %>% group_by(high_use, failures) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups:   high_use [2]
##   high_use failures count
##   <lgl>       <int> <int>
## 1 FALSE           0   244
## 2 FALSE           1    12
## 3 FALSE           2    10
## 4 FALSE           3     2
## 5 TRUE            0    90
## 6 TRUE            1    12
## 7 TRUE            2     9
## 8 TRUE            3     3
alc %>% group_by(high_use, failures>0) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   high_use [2]
##   high_use `failures > 0` count
##   <lgl>    <lgl>          <int>
## 1 FALSE    FALSE            244
## 2 FALSE    TRUE              24
## 3 TRUE     FALSE             90
## 4 TRUE     TRUE              24

As one can see, 5 of the 382 students have experienced 4 failures. From these 5, three report high alcohol consumption. This sample might however be to small to judge if a real relationship exist. Let’s create a bar and a box plot to investigate further.

g3 <- ggplot(alc, aes(x=failures, col=high_use))
g3 + geom_bar()

g4 <- ggplot(alc, aes(x = sex, y = alc_use, col=failures>0))
g4 + geom_boxplot()

Looking at the bar plot, one might at least come to the conclusion, that having experienced no failure makes it less likely to become addicted to high alcoholic consumption. However, the sample for having a failure is much less than for having none. It might make more sense to combine the numbers for failures > 1 together. Now, a box plot gives a nice visualisation of how having experienced at least 1 failure can impact the alcoholic consumption. Again, the impact seems to be much higher for the male students.

Hypothesis 3 - Family:

alc %>% group_by(famrel,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   famrel [5]
##    famrel high_use count
##     <int> <lgl>    <int>
##  1      1 FALSE        6
##  2      1 TRUE         2
##  3      2 FALSE       10
##  4      2 TRUE         9
##  5      3 FALSE       39
##  6      3 TRUE        25
##  7      4 FALSE      135
##  8      4 TRUE        54
##  9      5 FALSE       78
## 10      5 TRUE        24
g5 <- ggplot(alc, aes(x = high_use, y = famrel))
g5 + geom_boxplot() + ylab("quality of family relationships (from 1 - very bad to 5 - excellent)")

g6 <- ggplot(alc, aes(x=famrel, col=high_use))
g6 + geom_bar()

Both the box plot as well as the bar plot seem to at least give a trend on the correctness of the hypothesis: Whereas the mean value for the quality of the family relationships is at 3.5 for people stating a high alcohol consumption, it has a better mean quality (4.5) for students without hinting at alcohol problems. The bar plot gives a bit more clearer view on the different family relationship bins. The high-use fraction is getting more for famrel=2 or 3 than for 4 and 5. For famrel=1 the sample is probably to small to be judge.

Hypothesis 4 - going out

alc %>% group_by(goout,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   goout [5]
##    goout high_use count
##    <int> <lgl>    <int>
##  1     1 FALSE       19
##  2     1 TRUE         3
##  3     2 FALSE       84
##  4     2 TRUE        16
##  5     3 FALSE      103
##  6     3 TRUE        23
##  7     4 FALSE       41
##  8     4 TRUE        40
##  9     5 FALSE       21
## 10     5 TRUE        32
g7 <- ggplot(alc, aes(x = high_use, y = goout, col=sex))
g7 + geom_boxplot() + ylab("going out with friends (from 1 [very low] to 5 [very high]")

g8 <- ggplot(alc, aes(x=goout, col=high_use))
g8 + geom_bar()

Again, the numeric cross tabulations as well as the plots are supporting the hypothesis. Going out with friends more often (higher value) seems to have an impact on the alcohol consumption.

5. Logistic regression

# find the model with glm()
m <- glm(high_use ~ sex + failures + famrel + goout, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + failures + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6492  -0.7464  -0.5207   0.8642   2.4194  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3922     0.6415  -3.729 0.000192 ***
## sexM          0.8976     0.2530   3.548 0.000388 ***
## failures      0.3339     0.2034   1.641 0.100696    
## famrel       -0.3971     0.1390  -2.857 0.004281 ** 
## goout         0.7751     0.1217   6.367 1.93e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 391.08  on 377  degrees of freedom
## AIC: 401.08
## 
## Number of Fisher Scoring iterations: 4

The model summary shows a high significance for the correlation to sexM and goout, a medium significance for famrel und no signifant value for failure. One could fit the model again, dropping at least the failure variable. But first let’s study the odds ratios (OR).

coef(m)
## (Intercept)        sexM    failures      famrel       goout 
##  -2.3922069   0.8976484   0.3338923  -0.3971296   0.7751449
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.09142769 0.02506625 0.3124735
## sexM        2.45382585 1.50205707 4.0576659
## failures    1.39639281 0.93864152 2.0963627
## famrel      0.67224693 0.51012773 0.8814055
## goout       2.17090677 1.72124671 2.7771652

The odds ratios can tell us if having a certain property (e.g. being male, going out often) can be positively or negatively associated with the logical value looked at (here high alcohol consumption). If the OR is > 1, it means it is positively associated, with < 1 it is negatively associated.

The male sex variable has the strongest relationship to the binary target variable, with an odds ratio of +2.45 - that means that the probability for a male student having high alcohol consumption is about 2.5 higher than for a female student. “going out” is also positively associated with high alcohol consumption, wheres “family relation” has a negative relation (high quality of family relation leading more likely to low alcohol consumption). These OR thus support the above-mentioned hypotheses.

6. Predictive power of Model

As failures didn’t have a significant relationship according to my logistic regression model, I will drop this variable and fit the model again, print out the summary and calculate the OR and Confidence levels:

m2 <- glm(high_use ~ sex + famrel + goout, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5836  -0.7707  -0.5316   0.8198   2.5474  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3550     0.6414  -3.672 0.000241 ***
## sexM          0.9342     0.2514   3.716 0.000202 ***
## famrel       -0.4118     0.1387  -2.969 0.002989 ** 
## goout         0.7971     0.1214   6.565  5.2e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 393.80  on 378  degrees of freedom
## AIC: 401.8
## 
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept)        sexM      famrel       goout 
##  -2.3549669   0.9341541  -0.4117517   0.7971166
OR <- coef(m2) %>% exp
CI <- confint(m2) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.09489664 0.02602013 0.3241397
## sexM        2.54505957 1.56349803 4.1969640
## famrel      0.66248876 0.50299888 0.8679153
## goout       2.21913298 1.76086023 2.8372779

As a next step let’s check the predictive power of the new model, adding the new columns probability and predictions to the alc dataset. probablities will be calculated by using the predict() function on the model and predictions will defined as TRUE if the probability is higher than 50%.

probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)

To see an example on how good the model is, we can print the first rows of the columns we are interested in and look at the confusion matrix:

select(alc, sex, famrel, goout,  high_use, probability, prediction) %>% tail(10)
##     sex famrel goout high_use probability prediction
## 373   M      4     2    FALSE  0.18639810      FALSE
## 374   M      5     3     TRUE  0.25195331      FALSE
## 375   F      5     3    FALSE  0.11687357      FALSE
## 376   F      4     3    FALSE  0.16650200      FALSE
## 377   F      5     2    FALSE  0.05627990      FALSE
## 378   F      4     4    FALSE  0.30714360      FALSE
## 379   F      2     2    FALSE  0.17019623      FALSE
## 380   F      1     1    FALSE  0.12243164      FALSE
## 381   M      2     5     TRUE  0.85084788       TRUE
## 382   M      4     1     TRUE  0.09357856      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   250   18
##    TRUE     63   51

To graphically visualize the actual values and the predictions, we can draw a plot of ‘high use’ vs. ‘probability’.

g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()

Furthermore, we can do another cross-tab of predictions vs. actual values, printing out the fractions instead and adding margins.

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.04712042 0.70157068
##    TRUE  0.16492147 0.13350785 0.29842932
##    Sum   0.81937173 0.18062827 1.00000000

Accuracy of the model The proportion of inaccurately classified individuals (training error) can be calculated by taking the values from the confusion matrix: (18+63)/(250+18+63+51) = 0.21. We will get the same value by defiing a loss function as below:

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2120419

The loss function gives us a value of about 0.21 (as also calculated before), meaning that about 21% of the predictions will be wrong. This value is better than the value in the DataCamp example (about 26%), making this model slightly better.

7. Bonus: 10-fold cross-validation

To study the test set performance of the model one can use K-fold cross-validation. Here, we will try out 10-fold cross-validation, making use of the loss function defined above.

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K=10)
cv$delta[1] 
## [1] 0.2225131

The cv.glm function of the library boot can be used for this and K=10 defines the number of subsets we are doing the cross-validation on. The delta attribute stores the error value. It gives an error of about 0.23 which is better than the 0.26 error of the DataCamp model. Thus, it has a bit better performance.


Chapter 4: Clustering and classification

1-2. First look at the Boston dataset

This chapter is focused on clustering and classification and the Boston dataset will be the dataset to be explored. The dataset is already included in the internal MASS package so it can be loaded directly.

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

To get a first impression on what dataset we are dealing with, let’s look at its dimensions (dim(Boston)) and structure (str(Boston)) .

## [1] 506  14
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset is made up of 506 observations and 14 variables, describing the housing values in suburbs of Boston, including variables such as the per capita crime rate by town (crim), proportion of non-retail business acres per town (indus), nitrogen oxides concentration (nox), average number of rooms per dewelling (rm). Two of the variables (chas- Charles River dummy variable and rad - index of accessibility to radial highways) are integer type, the other numeric. Further information and the definitions for the different variables can be found at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html.

3. Graphical overview of the data and variable summaries

Let’s see if a `pairs() plot can show us some interesting things:

A summary (summary(Boston)) can give us more insight:

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Both the pairs plot and the summary are not easy to interpret just by a first glance. Another option is to make use of a correlation plot - thus to graphically print the correlation between the different variables.

library(corrplot)
library(tidyverse)
cor_matrix <- cor(Boston) %>% round(digits=2)
corrplot.mixed(cor_matrix, number.cex = .7)

In this plot we can nicely see, which variables are correlated the most by looking at the colours/diameters of the circles. For example those circles with a darker shade of blue have a correlation value close to 1 and one can compare with the pairs plot above and see that they have a linear tendency (e.g. nox and indus - denoting “nitrogen oxides concentration” and “proportion of non-retail business acres per town”). Then there are those variables which relationship to each other is visualised by a very small and nearly white circle, and those are the once with no significant relationship (e.g. rm and black - “average number of rooms per dwelling” vs. a value of black population in the town), which is also visible in the pair plot. Plotting the correlation plot furthermore as corrplot.mixed provides us additionally with the numerical correlation values.

4. Scaling the data, crime rate, preparing train and test sets

Scaling the data

To scale the data the scale() function can be used on the Boston dataset. Below a summary of the scaled data:

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

One can see that as a result to the scaling the values are now distributed around a mean value of 1, running thus from negative to positive. For further investigation we save the scaled data as well in the dataframe format and look at the first six lines:

boston_scaled <- as.data.frame(boston_scaled) # save data in a dataframe
head(boston_scaled)
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582

Categorical variable for the crime rate

The per capita crime rate by town can be found in the dataset in the continuous variable crim. This continuos variable should now be changed into a categorical one, using quantiles as break boints. Let’s look at the quantiles first with summary(boston_scaled$crim) and then define our bins:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins <- quantile(boston_scaled$crim)

The new factor variable crime will hold the crime data, categorized by “low”, “med_low”, “med_high” and “high”.

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Instead of the old crim variable, let’s now add the new crime variable to the boston_scaled dataframe and drop the old variable crim.

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Divide the data set into training and test sets

To divide the dataset into training and test sets we first need to know the length of the set, which we can do by using nrow(). A common partition is to use 80% of the data for training and 20% for testing, which we will do here. With sample() we can randomly choose a percentage of the given numbers and store them in a variable. Then we can build our train and test set with the randomly chosen rows.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

A table overview on the crime categories of the train and test sets gives us a first insight into the distribution amongst the different categories.

traincrime <- table(train$crime)
testcrime <- table(test$crime)
traincrime
## 
##      low  med_low med_high     high 
##      102       96       98      108
testcrime
## 
##      low  med_low med_high     high 
##       25       30       28       19

5. Linear Discriminant Analysis

LDA Fit on the crime training set

Linear Discriminant Analysis can be used as a classification method to find a linear combination of the variables in relation to the target variable classes. It is fit with the function lda() and takes as an input a function (e.g. target ~ x1 + x2 …) and the dataset. target ~ . means that all other variables in the dataset are used as predictors.

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2376238 0.2425743 0.2673267 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.02060578 -0.9282113 -0.11793298 -0.8733936  0.4315140 -0.9230130
## med_low  -0.09978273 -0.3723399  0.05576262 -0.6139579 -0.1388989 -0.3923673
## med_high -0.41548661  0.2537587  0.12941585  0.3383128  0.1105463  0.4561245
## high     -0.48724019  1.0149946 -0.01714665  1.0544272 -0.3945765  0.8002135
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8410313 -0.6913766 -0.7366847 -0.44725315  0.3838047 -0.78362389
## med_low   0.3906120 -0.5344476 -0.5262050 -0.04970884  0.3154788 -0.17748597
## med_high -0.3555123 -0.3970905 -0.2764631 -0.21465541  0.1080257 -0.02492083
## high     -0.8361555  1.6596029  1.5294129  0.80577843 -0.7702986  0.86810049
##                 medv
## low       0.54008236
## med_low   0.01523418
## med_high  0.17561537
## high     -0.66702063
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09446400  0.820847651 -0.96582703
## indus    0.13712959 -0.510616770  0.13334400
## chas    -0.05832506 -0.050540710  0.16767883
## nox      0.23880572 -0.570161438 -1.31665726
## rm      -0.07379480 -0.150250320 -0.21298121
## age      0.30576850 -0.455826365 -0.15658342
## dis     -0.06709117 -0.532099958  0.23301883
## rad      3.33897765  1.031374320  0.24620265
## tax      0.09667025 -0.102060062  0.29121988
## ptratio  0.13077399  0.066338044 -0.22293465
## black   -0.13655882 -0.002331274  0.06768378
## lstat    0.15866833 -0.249789900  0.32366804
## medv     0.16726719 -0.397575572 -0.19876941
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9510 0.0377 0.0113

The print of the fit shows us the result of the LDA fit on the train set, with the output showing us the three extracted discriminant functions LD1, LD2 and LD3 with the highest value being 0.94 for LD1. We get three discriminant values as we have four different groups (“low”,“med_low”, “med_high”, “high”) and discrinants are always one less than the number of groups.

LDA (bi)plot

For the LDA biplot a function has to be created to show as well the lda arrows for the different variables in the plot. The code for this function was taken from the datacamp exercise which refers to following Stack Overflow message thread. Then the classes are stored in a numeric vector and plotted, together with the arrows, in a LDA (bi)plot.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2,col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

From the plot we can see that the variable rad has a major contribution to LD1. Looking at the definition for rad (“index of accessibility to radial highways”), it seems that the accessibility to radial highways as a predictor enlargens the probability of being placed into the high category.

6. Predictions on the test data

Before going to the next step, let’s save the crime categories from the test set in correct_classes and the remove crime from the test dataset.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Now the lda.fit model can be used to predict the crime values of the before-created test dataset:

lda.pred <- predict(lda.fit, newdata = test)

A cross-tabulation with the original crime values (stored in correct_classes) can give a nice overview on how well the fitting worked.

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      12        2    0
##   med_low    3      18        9    0
##   med_high   2      10       15    1
##   high       0       0        1   18

Looking at the cross-tabulation one can see that for this data set the high prediction is almost perfect, the med-high classification quite good, but for the lower categories low and high the model could be improved.

7. Distances, K-Means and visualization of the clusters

First, let’s reload the Boston dataset and scale it to standardize the data, before comparing the distances.

# load MASS and Boston
library(MASS)
data('Boston')

boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2) # save data in a dataframe

Distances

Without specifying the method as an attribute in the dist(0) function, the euclidian distance is calculated. The method can be changed, e.g. to method="manhattan" which will then have different results.

dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

K-Means clustering

Let’s do K-Means clustering on the scaled dataset, starting with 4 as a first try for the number of cluster centers. The results can be looked at with pairs.

km <-kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2, col = km$cluster)

As it is difficult to read anything in the total pairs plot, let’s divide it exemplarily into 4 parts:

pairs(boston_scaled2[1:4], col = km$cluster)

pairs(boston_scaled2[5:8], col = km$cluster)

pairs(boston_scaled2[9:11], col = km$cluster)

pairs(boston_scaled2[12:14], col = km$cluster)

I will stick to the [5:8] part of the dataset and investigate how the number of clusters (2,3,5) will change the result of the initial number of 4 clusters.

km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2[5:8], col = km$cluster)

km <-kmeans(boston_scaled2, centers = 3)
pairs(boston_scaled2[5:8], col = km$cluster)

km <-kmeans(boston_scaled2, centers = 5)
pairs(boston_scaled2[5:8], col = km$cluster)

It seems to be difficult to decide which number of clusters is the best. One can use the total of within cluster sum of squares (WCSS) to help with the decision. The optimal number of clusters can be seen as the point when the total WCSS is dropping radically. Let’s investigate the behaviour of the total WCSS from 1 cluster to 10 clusters.

set.seed(123) # use a certain seed for the initial cluster centers
k_max <- 10 # set the maximum number of clusters

twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss}) #calculate the WCSS

qplot(x = 1:k_max, y = twcss, geom = 'line')

The total WCSS is falling steeply until ~2 clusters, afterwards a bit less and between 3 and 4 it is again rising and falling. So we shouldn’t use more than 3 clusters and 2 clusters would probably be the optimal value.

Let’s rerun the K-means clustering with 2 for the whole dataset.

km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2, col = km$cluster)


Chapter 5: Dimensionality Reduction Techniques

1. First look at the new data and graphical overview

The new datasets to be looked at are about the Human Development Index and the Gender Inequality Index in different countries around the world. The data orginates from the United Nations Development Programme and further informations can be found here. Some preliminary data wrangling has been done on the data and a combined dataset has been constructed out of it, excluding some of the variables and removing some missing values. All the mutations to the dataset are documented in the “create_human.R” file (in the data folder) and the new variable names are also defined there.

Let’s start by including the new prepared dataset human and look at dimensions, summaries and a graphical overview.

library(dplyr)
human <- read.table('./data/human.csv')
human <- as.data.frame(human)
dim(human)
## [1] 155   8
head(human)
##             SecEduRat LabForceRat EduYearsExp LifeExp   GNI MatMortRat
## Norway      1.0072389   0.8908297        17.5    81.6 64992          4
## Australia   0.9968288   0.8189415        20.2    82.4 42261          6
## Switzerland 0.9834369   0.8251001        15.8    83.0 56431          6
## Denmark     0.9886128   0.8840361        18.7    80.2 44025          5
## Netherlands 0.9690608   0.8286119        17.9    81.6 45435          6
## Germany     0.9927835   0.8072289        16.5    80.9 43919          7
##             AdoBirthRate PercParl
## Norway               7.8     39.6
## Australia           12.1     30.5
## Switzerland          1.9     28.5
## Denmark              5.1     38.0
## Netherlands          6.2     36.9
## Germany              3.8     36.9

We have 155 observations for different countries (row-names) and 8 variables, consisting of the ratio of females vs. males having at least secondary education (SecEduRat), the ratio of females vs. males in the labour force (LabForceRat), the expected years of schooling (EduYearsExp), life expectancy at birth (LifeExp), Gross National Income per capita (GNI), Maternal Mortality ratio (MatMortRat), Adolescent birth rate (AdoBirthRate) and percentage of female representatives in parliament (PercParl).

Visualizations

First we have to include some libraries to perform the visualisations.

library(GGally)
library(corrplot)
library(tidyverse)

First, with ggpairs(), let’s look at a pairplot and afterwards with corrplot() at a visualization of the correlation matrix.

ggpairs(human)

cor(human) %>% corrplot.mixed(number.cex = .7, tl.cex=0.5)

Looking at both plots, we can see that the largest correlation is between the life expectancy at birth and the maternal mortality ratio (0.86). This doesn’t sound too surprising, as probably a country with a high life expectancy value should mean that the facilities e.g. in hospitals are quite good and thus also better for birth-giving - and on the other hand a low life expectancy correlating with a high maternal mortality rate. Another high correlation is between EduYearsExp and LifeExp (0.79), letting us conclude that in countries where the life expectancy is higher, the education system is probably better as well. Another significant correlation (0.76) is between MatMortRat and AdoBirthRate. The adolescent birth rate is the birth rate per 1000 women aged 15-19 (description see here). The high correlation with the maternal mortality rate would support the statement given on the above-mentioned WHO site that people giving birth early “are subject to higher risks or complications or even death during pregnancy and birth”.

Let’s look at these two variables as an example.

library(ggplot2)
g <- ggplot(human, aes(x = AdoBirthRate, y=MatMortRat))
g + geom_point()

The trend is quite visible.

2. PCA on the non-standardized data

A summary() of the data gives us once more information about the variables we are dealing with.

summary(human)
##    SecEduRat       LabForceRat      EduYearsExp       LifeExp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI           MatMortRat      AdoBirthRate       PercParl    
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

To gain some more understanding about the correlations between the different variables we want to use Principal Component Analysis (PCA), to extract the principal components of our data matrix and create a lower dimension representation with those. We will perform the PCA with the Singular Value Decomposition (SVD) method, which can be done in R using the prcomp() function. As a start, the PCA will be performed on the non-standardized data and a biplot is drawn with the two leading principal components (PC1, PC2).

pca_human <- prcomp(human)
s <- summary(pca_human)
s # Look at the summary of the pca values
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
pca_pr <- round(100*s$importance[2, ], digits=1) #Round the values and print them as percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
biplot(pca_human, choices = 1:2,cex = c(0.6, 0.8),col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Already the error messages tell us that we face a problem here: The angles can’t be drawn in a correct way, and the variables can’t be compared without standardizing them. Most of the data is accumulated in one corner of the plot. We can see that the variability is captured only by the first principal component (100%), whereas all the other PC are 0. GNI might have the highest contribution here, as it is the only arrow visible. One can see clearly, that a standardization of the variables is needed.

3. PCA on the standardized data

Let’s standardize the data now first and then look again at the importance of the different principal components.

human_std <- scale(human)
summary(human_std)
##    SecEduRat        LabForceRat       EduYearsExp         LifeExp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI            MatMortRat       AdoBirthRate        PercParl      
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

The variables are now normalized, as can be seen above. Let’s see how this will now affect the PCA.

pca_human_std <- prcomp(human_std)

s2 <- summary(pca_human_std)
s2 # Look at the summary of the pca values
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
pca_pr_std <- round(100*s2$importance[2, ], digits=1) #Round the values and print them as percentages of variance
pca_pr_std
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

Now the contributions of the different PC make more sense: The first principal component captures about 53% of the variability, the second one 16.2%, followed by PC3(9.6%), PC4(7.6%), PC5(5.5%), PC6(3.6%), PC7(2.6%) and PC8(1.3%). Now let’s make a new plot, stating the importance of the first two components in the labels.

# Create a labeling object for the x- and y axis
pc_lab <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")

# Draw the new biplot with scaled values and labels for the axes
biplot(pca_human_std, cex = c(0.6, 0.8), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

4. Interpretations of the first two principal component dimensions

We can now differentiate between the different arrows and make some comments on their relationships. We have seen before that AdoBirthRate and MatMortRat have a significant correlation. Looking at the PCA biplot of the standardized dataset, we can see that the angle between those values is very small, which also supports a high correlation. Moreover, the angle between these variables and the PC1 axis is quite small and almost orthogonal to PC2, which means they have a high correlation to the first principal component but not to the second. The only two variables which seem to have the main impact to PC2 are PercParl and LabForceRat. Based on the biplot drawn after the PCA, I would say that the first two principal component dimensions describe the data quite well, as the variables seem to have a good correlation with either the first or the second component. Looking at the standard deviations of the different components (could be visualised in a scree plot), one can see that only PC1 and PC2 are above 1, whereas PC3 is already below 1. This also suggests that the first two components could be enough to describe the correlations. The first two components together account for almost 70% of the variance in the dataset.

5. Exploring the tea dataset

Next, we will explore the tea dataset from the package FactoMineR. The data comes from a questionnaire on tea-drinking (e.g. where, when, what and some personal data). More information on the dataset can be found here. First, we have to import the FactoMineR package and load the tea dataset.

library(FactoMineR)
library(ggplot2)
library(dplyr)

data(tea)

dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Looking at the dimensions and structure of the dataset we can see it is made up of 300 observations (people interviewed) and 36 variables. Using a pairs or ggpairs plot on the complete dataset would be too much. (Tried it, takes forever, and you can’t decipher anything.;)) So, let’s look directly at a subset of the data.

I will keep the same subset as was examined in the datacamp exercise, look at the summary and gather the variables in bar plots.

#library(FactoMineR); library(dplyr); library(ggplot2); library(tidyr)
# column names to keep in the new dataset tea_time
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))

summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Taking a quick glance at the data, one can see that most of the people interviewed prefer Earl Grey, drink tea pure (apart from maybe sugar), prefer teabags and don’t drink tea at lunch. About half of the interviewees take tea without sugar, the other half with. This is just a first quick view on the summary data - the kind of tea might for example have an impact on whether one takes it with milk, sugar, lemon etc. or without. So let’s continue with a Multiple Correspondence Analysis (MCA) to dig deeper into the qualitative dataset.

Multiple Correspondence Analysis (MCA)

MCA is a method to analyze qualitative data. It can be used to find and explore patterns and (similar to PCA) reduce the dimensions. The summary() on the MCA methods provides us with statistical outputs, such as the v.test and importance of the dimensions and the correlations with the different variables.

mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"))

plot(mca, invisible=c("ind"),habillage = "quali")

The summary gives us values for the different dimensions - In the MCA biplot, the first two dimensions (Dim1: 15.24%, Dim2: 14.23%) are shown. Using the habillage attribute conveniently colours the variables according to the categories (“How”: alone/lemon/milk/other, “how”:tea bag/tea bag+unpackaged/unpackaged, “lunch”: lunch/Not.lunch, “sugar”: No.sugar/sugar, “Tea”: black/Earl Grey/green and “where”: chain store/chain store+tea shop/tea shop).

Let’s investigate further: One aspect that can be explored are the different tea types and sugar. Both No.sugar am sugar are quite close to Earl Grey, but sugar is farther away from black and green tea, suggesting that more people who drink Earl Grey might add sugar than people drinking black or green tea. Another thing quite visible is how much the different variables are correlated to Dim1 and Dim2. other has apparently the highest correlation to Dim2 (followed by chain store+tea shop, tea bag+upackaged and green), whereas unpackaged and tea shop show a higher correlation to Dim1, but as well a significant correlation to Dim2.


Chapter 6: Analysis of Longitudinal Data

1. Analysis of the RATS dataset

#include libraries
library(ggplot2)
library(dplyr)
library(tidyr)

First look at the RATS dataset

We have wrangled the original wide RATS dataset from here into a long version and saved it in the data-folder (The description of the data wrangling part can be found in the script meet_and_repeat.R in the data folder.) The dataset compares 16 rats belonging to one of three groups. Depending on the group membership they get a different diet and the weights are compared at different times over a complete timespan of 64 days.

First, let’s read in the original dataset RATS and the long and further prepared version RATSL and get an overview on what we are dealing with

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
RATSL <- read.csv("./data/ratsl.csv") # Read in the long version of the RATS dataset
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

Using head() and tail() one can easily see the differences between the wide (RATS) and the long version (RATSL) of the data. Whereas in RATS the different weeks are all separate variables, in RATSL they are combined to the weeks variable which has the different weeks as categories. The datatable thus becomes smaller in width but longer in length (dimensions: 176, 5), which makes it a better format for the following analysis.

head(RATS)
##   ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1  1     1 240 250  255  260  262  258  266  266  265  272  278
## 2  2     1 225 230  230  232  240  240  243  244  238  247  245
## 3  3     1 245 250  250  255  262  265  267  267  264  268  269
## 4  4     1 260 255  255  265  265  268  270  272  274  273  275
## 5  5     1 255 260  255  270  270  273  274  273  276  278  280
## 6  6     1 260 265  270  275  275  277  278  278  284  279  281
head(RATSL)
##   ID Group days weight Time
## 1  1     1  WD1    240    1
## 2  2     1  WD1    225    1
## 3  3     1  WD1    245    1
## 4  4     1  WD1    260    1
## 5  5     1  WD1    255    1
## 6  6     1  WD1    260    1
tail(RATSL)
##     ID Group days weight Time
## 171 11     2 WD64    472   64
## 172 12     2 WD64    628   64
## 173 13     3 WD64    525   64
## 174 14     3 WD64    559   64
## 175 15     3 WD64    548   64
## 176 16     3 WD64    569   64

The variables of RATSL are as follows:

  • ID: factor value, ID of the rat in a certain group
  • Group: factor, gives the group the rat is belonging to
  • days: string, gives the day number in a string version
  • weight: integer, gives the width for the rat with ID X in group X at a certain Time X
  • Time: integer, integer version of the day

Visualisations

Let’s first try a plot of the weight against the time, differentiating the different nutrition study groups by color.

# Plot the RATSL data
ggplot(RATSL, aes(x = Time, y = weight, linetype = ID)) +
  geom_line(aes(color=ID)) +
  scale_linetype_manual(values = 1:16) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(limits = c(min(RATSL$weight), max(RATSL$weight)))

It is clearly visible, that the rats belonging to the nutrition study group 1 have the least starting weight. The rats in group 2 seem to have a steeper rise in the weight. This phenomenon - tracking individual observations - might become more visible when plotting standardized data.

Standardization To standardize the values of each observation, we subtract the relevant occasion mean from the original observation and divide it by the corresponding standard deviation: \[standardised(x) = \frac{x- mean(x)}{sd(x)} \] The standardized values will be added to a new column stdweight into our RATSL dataset. We can then explore a plot with the now standardized data.

# Standardise the variable weight
RATSL <- RATSL %>% group_by(Time) %>% mutate(stdweight = (weight - mean(weight))/sd(weight)) %>% ungroup()

# Glimpse at the RATSL data with the new column stdweight
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,…
## $ days      <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD…
## $ weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555,…
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8,…
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001,…
# Plot again with the standardised weight
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line(aes(color=ID)) +
  scale_linetype_manual(values = 1:16) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = RATSL$stdweight)

RATSL <- RATSL %>% group_by(Time) %>% mutate(stdweight = (weight - mean(weight))/sd(weight)) %>% ungroup() does exactly the same thing as scale(), but let’s now here use the formula given above to show it clearly.

In difference to the non-standardized plot one can see that the rising character of the individual lines is not so clear anymore. Some lines show even a negative slope, others are almost horizontal. The observations of group 1 are quite stable.

Summary Graphs

The differences amongst the different group can be explored further by employing summary graphs.

This can be achieved by plotting the mean of the weight values for the different times for the three nutrition study groups respectively and add the standard error of the mean as error bars: \[ se = \frac{sd(x)}{\sqrt{n}} \]

First some preparations:

# Number of measurement times, baseline (time 0) included
n <- RATSL$Time %>% unique() %>% length()

# Summary data with mean and standard error of bprs by treatment and week 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(weight), se = sd(weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…

As already stated above, the summary graph shows a mean rising slope for all three groups and a huge difference in weight between group 1 and the two other groups.

# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line(aes(color=Group)) +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1", color=Group), width=0.3) +
  theme(legend.position = "top") +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")

Box Plots

Box plots can be another alternative for visualisation of the data.

ggplot(RATSL, aes(x = factor(Time), y = weight, fill = Group)) +
  geom_boxplot(position = position_dodge(width = 0.9)) +
  theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  theme(legend.position = "top") +
  scale_x_discrete(name = "Time")

In this Box Plot diagram we can see the outliers in the different groups for every measurement. Still, it is not clear if this visualisation makes much sense as there are quite few individuals per group (8 in Group 1, 4 in Group 2 and 4 in Group 3). A possibility is (in order to have more datapoints) to make the mean boxplot (over time) for the different groups.

RATSLSUM <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(weight) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSLSUM)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
ggplot(RATSLSUM, aes(x = Group, y = mean)) +
  geom_boxplot() +
  theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(weight), time > 1 day")
## Warning: `fun.y` is deprecated. Use `fun` instead.

As expected from the plot before, we can still see the outliers in the three different groups. Next, let’s remove these from the boxplot. Let’s start by just removing the most obvious outlier from group 2 which can be rejected by cutting on the mean weight value.

RATSLSUM1 <- RATSLSUM %>% filter(mean < 570)

glimpse(RATSLSUM1)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
ggplot(RATSLSUM1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(weight), time > 1 day")
## Warning: `fun.y` is deprecated. Use `fun` instead.

Now let’s get rid of all three visible outliers. We can identify the individuals we want to filter from the summary plots and the corresponding legend. The most obvious outlier is #12 from Group 2, followed by #13 from Group 3 and #2 from Group 1.

RATSLSUM1 <- RATSLSUM %>% filter(ID !=2& ID !=12 & ID !=13)

glimpse(RATSLSUM1)
## Rows: 13
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID    <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean  <dbl> 263.2, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 457.5, …
ggplot(RATSLSUM1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(weight), time > 1 day")
## Warning: `fun.y` is deprecated. Use `fun` instead.

Without the outliers, the boxplots are quite sharp, less skewed (there is just little variation inside the groups, but quite some between the groups). But again, the sample size is quite limited, so one could question, if this visualisation makes sense and removing three individuals from a set of 16 is already (and considering the groups 1 out of 4 for groups 2 and 3) is already quite a fraction.

Perform T-Test and Anova test on the data

T-test

To get more statistical information on the relationship between the different groups, one can do a t-test. As the t-test only compares to different groups, I will first perform group cuts to compare always two groups.

The t-test confirms – through a very low p-value – the group difference.

# CUTS ON THE GROUPS (outliers excluded)
RATSCUT23 <- RATSLSUM1 %>% filter(Group !=3) 
RATSCUT13 <- RATSLSUM1 %>% filter(Group !=2) 
RATSCUT12 <- RATSLSUM1 %>% filter(Group !=3) 

t.test(mean ~ Group, data = RATSCUT23, var.equal = TRUE) # T-test on Group 2 and Group 3
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2 
##        268.7571        452.4000
t.test(mean ~ Group, data = RATSCUT13, var.equal = TRUE) # T-test on Group 1 and Group 3
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -77.715, df = 8, p-value = 8.377e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -277.5065 -261.5125
## sample estimates:
## mean in group 1 mean in group 3 
##        268.7571        538.2667
t.test(mean ~ Group, data = RATSCUT12, var.equal = TRUE) # T-test on Group 1 and Group 2
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2 
##        268.7571        452.4000

To check, if the significance changed a lot by removing the outliers, let’s do the same test including the outliers.

# CUTS ON THE GROUPS (outliers included)
RATS23 <- RATSLSUM %>% filter(Group !=3)
RATS13 <- RATSLSUM %>% filter(Group !=2)
RATS12 <- RATSLSUM %>% filter(Group !=3)

t.test(mean ~ Group, data = RATS23, var.equal = TRUE) # T-test on Group 2 and Group 3
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -9.0646, df = 10, p-value = 3.88e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -277.5345 -168.0155
## sample estimates:
## mean in group 1 mean in group 2 
##         265.025         487.800
t.test(mean ~ Group, data = RATS13, var.equal = TRUE) # T-test on Group 1 and Group 3
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3 
##         265.025         527.500
t.test(mean ~ Group, data = RATS12, var.equal = TRUE)# T-test on Group 1 and Group 2
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -9.0646, df = 10, p-value = 3.88e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -277.5345 -168.0155
## sample estimates:
## mean in group 1 mean in group 2 
##         265.025         487.800

Still, we have quite low p-values, although the values are now some orders higher. Especially the comparison group 1 to group 3 gives a very small p-value.

Anova test

Another test that can be performed on the dataset is the Anova test. For it, we add a baseline to be able to perform baseline measurements of the outcome variable. We will define the values of WD1 from the original RATS dataset as the baseline.

# Add the baseline from the original data as a new variable to the summary data
RATSLSUM2 <- RATSLSUM %>% mutate(baseline = RATS$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSLSUM2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSLSUM2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.905  -4.194   2.190   7.577  14.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.16375   21.87657   1.516   0.1554    
## baseline     0.92513    0.08572  10.793 1.56e-07 ***
## Group2      34.85753   18.82308   1.852   0.0888 .  
## Group3      23.67526   23.25324   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.992 
## F-statistic: 622.1 on 3 and 12 DF,  p-value: 1.989e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One can see that the baseline of RATS is strongly related to the BPRS values to the weight values in the following time, where the significance for the group has a higher p-value (about 0.076). Thus, the weight gain seems to be depending more on the starting weight than on the provided diet.

2. Analysis of the BPRS dataset

First look at the BPRS dataset

We have wrangled the original wide BPRS dataset from here into a long version and saved it in the data-folder (The description of the data wrangling part can be found in the script meet_and_repeat.R in the data folder.) The data set compares a certain psychological value (BPRS) for 40 individuals over a timespan of several weeks. The individuals are given two different medicamentations and are thus either belonging to treatment group 1 or 2.

The variables are as follows:

  • treatment: factor, tells us, whether the individual belongs to treatment group 1 or 2
  • subject: factor, gives the ID of the individual in the treatment group: ID 1 with treatment 1 is not the same as ID 1 with treatment 2!
  • weeks: string, gives the week number in a string version
  • bprs: integer, gives the bprs value for the individual of ID (subject), treatment group (treatment) and for the given week (week)
  • week: integer, integer value of the week number
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T) # Original BPRS data
BPRSL <- read.csv("./data/bprsl.csv") %>% as.data.frame() # Read in the long version of the BPRS dataset
BPRSL$treatment <- factor(BPRSL$treatment) # factor the categorical values
BPRSL$subject <- factor(BPRSL$subject)

As it was for the RATS and RATSL dataset, we can see the differences between wide and long form by glimpsing at the data and looking at it with head() and tail().

glimpse(BPRS)
## Rows: 40
## Columns: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68,…
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65,…
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49,…
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36,…
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32,…
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27,…
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30,…
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37,…
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
head(BPRS)
##   treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1         1       1    42    36    36    43    41    40    38    47    51
## 2         1       2    58    68    61    55    43    34    28    28    28
## 3         1       3    54    55    41    38    43    28    29    25    24
## 4         1       4    55    77    49    54    56    50    47    42    46
## 5         1       5    72    75    72    65    50    39    32    38    32
## 6         1       6    48    43    41    38    36    29    33    27    25
head(BPRSL); tail(BPRSL)
##   treatment subject weeks bprs week
## 1         1       1 week0   42    0
## 2         1       2 week0   58    0
## 3         1       3 week0   54    0
## 4         1       4 week0   55    0
## 5         1       5 week0   72    0
## 6         1       6 week0   48    0
##     treatment subject weeks bprs week
## 355         2      15 week8   37    8
## 356         2      16 week8   22    8
## 357         2      17 week8   43    8
## 358         2      18 week8   30    8
## 359         2      19 week8   21    8
## 360         2      20 week8   23    8

The different week variables are now gathered together in one weeks variable and another column with the integer values of the weeks is appended.

Visualisation

Let’s try looking at the data with a bprs against week plot, numbering the different subjects.

# Check the dimensions of the data
dim(BPRSL)
## [1] 360   5
# Plot the BPRSL data
ggplot(BPRSL, aes(x= week, y = bprs, group = treatment)) + 
  geom_text(aes(label = subject, color=treatment)) +
  scale_x_discrete(name = "Week") + 
  scale_y_discrete(name = "bprs") +
  theme_bw() + 
  theme(legend.position = "top") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

As we can see in the above plot, we have the different subject ID double, because the numbers 1-20 were both given to the different individuals belonging to group 1 and 2. Thus, it makes sense to differentiate between the treatment groups. The same plot can be done as well without the subject IDs, to see if we can see some group specific patterns.

ggplot(BPRSL, aes(x= week, y = bprs, group = treatment)) + 
  geom_point(aes(color = treatment)) +
  scale_x_continuous(name = "Week" , breaks = seq(0, 8, 1)) + 
  scale_y_continuous(name = "bprs") +
  theme_bw() + 
  theme(legend.position = "top") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

The below plot makes it easier to track the different subjects and divide between the different groups. It is a bit less “chaotic” like this.

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line(aes(color=subject)) +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Linear Regression for the BPRSL dataset

As the visualisations are still quite chaotic, we will fit as a next step a linear regression model, using the weight as response and Time and Group as explanatory variables.

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

There is a high significance for a linear relationship between weeks and bprs. However, there is no significant variation between treatment group 1 and 2.

Random Intercept Model

The model above still ignores the repeated measures and just evaluates every observation of the weight as uncorrelated - this is probably far from the truth. So we will now look at more advanced models. The random intercept model makes it possible for the different subjects to have different intercepts in the linear regression fit. In the first fit of the random intercept model we use again week and treatment as explanatory variables.

For the fit the lme4 package can be used: It again takes the formula as a first argument, but in addition to the fixed effects also takes into account random-effects, which are distinguished from the other variables with a | (1 referring to the intercept).

# access library lme4
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Random Intercept and Random Slope Model

If we now also include the week as a random effect, the individual subjects will be also allowed to differ in slope, so that we can account in this case also for differences in time.

# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

One can compare the two different models using Anova again.

# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing the \(\chi^2\) values of the two fits, we can see that BPRS_ref1 seems to be better that BPRS_ref.

Random Intercept and Random Slope Model with interaction

# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again looking at the \(\chi^2\) , BPRS_ref2 performs a bit better than BPRS_ref1, so let’s take this model for another visualisation. We save the fitted values of this model in a vector, add it to our BPRSL data as another column and then do the same plot as above with the fitted data.

# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(fitted = Fitted)


# draw the plot of BPRSL with the fitted values of weight
ggplot(BPRSL, aes(x= week, y = fitted, color=subject)) + 
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=2)) +
  facet_grid(. ~ treatment, labeller = label_both) 

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line(aes(color=subject)) +
  scale_linetype_manual(values = rep(1:10, times=2)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Comparing with the original plot, one can see that the model is not perfect, but shows the tendencies.